{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Discussion 2: Test Sets and Validation\n",
    "\n",
    "## Introduction\n",
    "\n",
    "The typical supervised training problem goes like this:\n",
    "- There is a set $\\mathcal{X}$ of *inputs* and a set $\\mathcal{Y}$ of *outputs*\n",
    "- There is a function $f \\colon \\mathcal{X} \\to \\mathcal{Y}$ that *we would like to learn*\n",
    "- We have access to a large amount of data $D = \\{(x_i, y_i)\\}_{i =1}^n$ such that $y_i = f(x_i) + \\varepsilon_i$, where $\\varepsilon_i$ is noise\n",
    "- We have an algorithm $A$ that takes our data as input and gives us a function $g \\colon \\mathcal{X} \\to \\mathcal{Y}$, i.e., $g = A(D)$, such that $g(x_i) \\approx y_i$ for all $i$, i.e., $g$ matches our data well\n",
    "- But what we really want is $g(x) \\approx f(x)$ for all $x \\in \\mathcal{X}$\n",
    "\n",
    "### The test set\n",
    "- How do we estimate the ability of $g$ to approximate $f$ on values of $x$ unseen during training?\n",
    "- The typical way to do this is to split $D$ into two disjoint sets: $D = D_r \\cup D_e$, where $D_r$ is called the *training set*, and $D_e$ is called the *test set*\n",
    "- We use the training set in order to learn the function: $g = A(D_r)$\n",
    "- And we use the test set to estimate how well $g$ does on data it didn't see during training. The hope is that the *test error* will be close to the *true error* of the approximation\n",
    "- The test set $D_e$ is only used to estimate the *true error* of $g$\n",
    "\n",
    "#### Example\n",
    "Consider the function\n",
    "$\n",
    "f(x) = e^{\\frac{x-a}{b}}\n",
    "$\n",
    "for some constants $a$ and $b$.\n",
    "We will do the following\n",
    "- Sample points $\\{x_i\\}_{i=1}^n$ in the interval $[-1,1]$, evaluate $f$ at those points, and add noise: $f(x_i) + \\varepsilon_i$\n",
    "- Extract a subset of those points and call this set of points $D$. This is the data from which we can learn an approximation\n",
    "- Split $D$ into a training set $D_r$ and a test set $D_e$\n",
    "- Use linear regression to fit the data $D_r$ to a specific set of features\n",
    "- Then we will check the error of the estimate on the test set, and on the entire set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "import ipywidgets as widgets\n",
    "from ipywidgets import interactive, fixed\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "# we sample the x values uniformly\n",
    "N = 5000\n",
    "seed = 10000\n",
    "np.random.seed(seed)\n",
    "xVals = np.sort(2*np.random.rand(N)-1.0).reshape(-1,1)\n",
    "\n",
    "# evaluate the true function values\n",
    "# differentiate between that and the noisy observations\n",
    "noise_weight = 2\n",
    "yVals = np.exp((xVals + 1.0)/1.0)\n",
    "yValsNoisy = yVals + noise_weight*np.random.randn(N,1)\n",
    "\n",
    "# plot the noisy data that we will use for training and the\n",
    "# original function that we are trying to learn\n",
    "fig, ax = plt.subplots()\n",
    "ax.plot(xVals, yValsNoisy, '.', xVals, yVals, '-')\n",
    "ax.set(xlabel='x', ylabel='y')\n",
    "ax.legend(['Noisy Values', 'Original function'],loc='upper left')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# From the true data, someone extracts a subset and gives it to us. In this case,\n",
    "# this person samples the data randomly\n",
    "n = int(N*0.1)\n",
    "np.random.seed(seed)\n",
    "indices = np.sort(np.random.choice(N, n))\n",
    "xGiven = xVals[indices].reshape((-1,1))\n",
    "yGiven = yValsNoisy[indices].reshape((-1,1))\n",
    "yTrueAtGivenx = yVals[indices].reshape((-1,1))\n",
    "\n",
    "# let's plot the data we are given\n",
    "fig, ax = plt.subplots()\n",
    "ax.plot(xGiven, yGiven, '.', xGiven, yTrueAtGivenx, '-')\n",
    "ax.set(xlabel='x', ylabel='y')\n",
    "plt.title(\"The data given to us: \" + str(n) + \" points\")\n",
    "ax.legend(['Data given to us', 'Original function'],loc='upper left')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# we split the data given to us into a training set and a test set\n",
    "def trainTestPartition(xValues, yValues):\n",
    "    n = len(xValues)\n",
    "    trainIndices_num = np.sort(np.random.choice(n, int(0.8*n), replace=False))\n",
    "    # Now index using Boolean arrays\n",
    "    testIndices = np.ones((n,1),bool)\n",
    "    testIndices[trainIndices_num] = False\n",
    "    trainIndices  = np.zeros((n,1),bool)\n",
    "    # We just set all trainIndices to zero. This will give us an empty\n",
    "    # training set. Add code to fix the assignment to trainIndices.\n",
    "    # Hint: you already have an assignment for testIndices\n",
    "    ### start trainIndices ###\n",
    "    ### end trainIndices ###\n",
    "\n",
    "    xTrain = xValues[trainIndices].reshape(-1,1)\n",
    "    yTrain = yValues[trainIndices].reshape(-1,1)\n",
    "    xTest = xValues[testIndices].reshape(-1,1)\n",
    "    yTest = yValues[testIndices].reshape(-1,1)\n",
    "    \n",
    "    return xTrain, yTrain, xTest, yTest\n",
    "\n",
    "# seed the random number generator\n",
    "np.random.seed(seed)\n",
    "\n",
    "# we will now fit a polynomial to this data\n",
    "# In principle, we don't know how the data was generated. We will try to fit the data with a \n",
    "# degree-5 polynomial\n",
    "def featurizeVector(xVec):\n",
    "    return np.concatenate((np.ones(xVec.shape), xVec, xVec**2, xVec**3,\n",
    "                           xVec**4, xVec**5), axis=1)\n",
    "\n",
    "def predict(reg, xVec):\n",
    "    A = featurizeVector(xVec)\n",
    "    return reg.predict(A)\n",
    "\n",
    "def fit(xAry, yAry):\n",
    "    A = featurizeVector(xAry)\n",
    "    return LinearRegression().fit(A, yAry)\n",
    "\n",
    "# we partition the data given to us into two sets: train and test\n",
    "xTrain, yTrain, xTest, yTest = trainTestPartition(xGiven, yGiven)\n",
    "\n",
    "# solve the least squares problem A w = yGiven\n",
    "reg = fit(xTrain, yTrain)\n",
    "\n",
    "# report the training error\n",
    "y_train_predictions = predict(reg, xTrain)\n",
    "trainErr = mean_squared_error(y_train_predictions, yTrain)\n",
    "print(\"Training error: \" + str(trainErr))\n",
    "\n",
    "# report the test error\n",
    "y_test_predictions = predict(reg, xTest)\n",
    "testErr = mean_squared_error(y_test_predictions, yTest)\n",
    "print(\"Test error: \" + str(testErr))\n",
    "\n",
    "# because we have access to the entire dataset, we can also check\n",
    "# the true error on the dataset\n",
    "y_vals_predictions = predict(reg, xVals)\n",
    "trueErr = mean_squared_error(y_vals_predictions, yVals)\n",
    "print(\"True error: \" + str(trueErr)) \n",
    "\n",
    "# plot the predictions of the polynomial over the entire interval [-1,1]\n",
    "# and also plot the original function\n",
    "yAllPredictions = predict(reg, xVals)\n",
    "fig, ax = plt.subplots()\n",
    "ax.plot(xVals, yAllPredictions, '.', xVals, yVals, '-')\n",
    "ax.set(xlabel='x', ylabel='y', title='Function approximation')\n",
    "ax.legend(['Function learned', 'Original function'],loc='lower right')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **What can you say about these errors? Are they what you would expect?**\n",
    "- **Does the train error closely track the test error?**\n",
    "- **Is the test error close to the true error?**\n",
    "- **How many significant digits do you have?**\n",
    "\n",
    "Suppose that the data we are given is now not randomly sampled. We'll see what happens..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# now we are given all data for x values larger than 0.2\n",
    "indices = xVals > 0.2\n",
    "xGiven = xVals[indices].reshape((-1,1))\n",
    "yGiven = yValsNoisy[indices].reshape((-1,1))\n",
    "yTrueAtGivenx = yVals[indices].reshape((-1,1))\n",
    "\n",
    "# let's plot the data we are given\n",
    "fig, ax = plt.subplots()\n",
    "ax.plot(xGiven, yGiven, '.', xGiven, yTrueAtGivenx, '-')\n",
    "ax.set(xlabel='x', ylabel='y')\n",
    "plt.title(\"The data given to us\")\n",
    "ax.legend(['Data given to us', 'Original function'],loc='upper left')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "np.random.seed(seed)\n",
    "\n",
    "# we partition the data given to us into two sets\n",
    "xTrain, yTrain, xTest, yTest = trainTestPartition(xGiven, yGiven)\n",
    "\n",
    "# solve the least squares problem A w = yGiven. This will fit a function\n",
    "# w_0 + w_1 x + w_2 x^2 + w_3 x^4 + w_4 x^5\n",
    "reg = fit(xTrain, yTrain)\n",
    "\n",
    "# report the training error\n",
    "y_train_predictions = predict(reg, xTrain)\n",
    "trainErr = mean_squared_error(y_train_predictions, yTrain)\n",
    "print(\"Training error: \" + str(trainErr))\n",
    "\n",
    "# report the test error\n",
    "y_test_predictions = predict(reg, xTest)\n",
    "testErr = mean_squared_error(y_test_predictions, yTest)\n",
    "print(\"Test error: \" + str(testErr))\n",
    "\n",
    "# because we have access to the entire dataset, we can also check\n",
    "# the true error on the dataset\n",
    "y_vals_predictions = predict(reg, xVals)\n",
    "trueErr = mean_squared_error(y_vals_predictions, yVals)\n",
    "print(\"True error: \" + str(trueErr)) \n",
    "\n",
    "# plot the predictions of the polynomial over the entire interval [-1,1]\n",
    "# and also plot the original function\n",
    "yAllPredictions = predict(reg, xVals)\n",
    "fig, ax = plt.subplots()\n",
    "ax.plot(xVals, yAllPredictions, '.', xVals, yVals, '-')\n",
    "ax.set(xlabel='x', ylabel='y', title='Function approximation')\n",
    "ax.legend(['Function learned', 'Original function'],loc='upper right')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **Again, does the train error closely track the test error?**\n",
    "- **Is the test error close to the true error?**\n",
    "- **Was there a difference between the behavior of these errors as we were given different datasets? Comment on the differences**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The validation set\n",
    "### Hyperparameters\n",
    "- In the example we just saw, we fixed from the beginning the features whose coefficients we wanted to train. We said that we would learn a function of the form $\\sum_{i = 0}^5 w_i x^i$\n",
    "- In general, we don't have a polynomial order fixed in advance, but we test several orders, and make a choice\n",
    "- This means that in our previous framework, the algorithm $A$ usually does not depend only on data, but also depends on some fixed values $\\phi$ that we call *hyperparameters*. Example of hyperparameters are the number of features to use or the value of constants used for regularization. Hyperparameters are values that affect the choice of the model that is learned, but which are not themselves learned within our inner training algorithms\n",
    "- Suppose we have $m$ values that we want to evaluate for those hyperparameters. This means that we will learn $m$ different functions, one for each value:\n",
    "$$ g_j = A(D_r; \\phi_j)$$\n",
    "for $j = 1, \\ldots, m$. How do we choose which $g_j$ to keep?\n",
    "- Here's where a validation set comes into play\n",
    "\n",
    "### Validation set\n",
    "- We are given the data $D$ and make a partition of this set into three disjoint sets: $D = D_r \\cup D_e \\cup D_v$, the train set, the test set, and the validation set\n",
    "- We train the $m$ functions $g_j$ using the train set $D_r$\n",
    "- The *validation set* $D_v$ is used to make a decision between the various $g_j$: the $g_j$ with the smallest *validation error* wins\n",
    "- After we pick the model (i.e., the $g_j$) with lowest validation error, we use the test set to evaluate the test error as a proxy for the true error\n",
    "- *The validation set is used to decide how to set hyperparameters, but the test set cannot be used to influence a choice of model. The test set is used once*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# From the true data, someone extracts a subset and gives it to us. In this case,\n",
    "# this person samples the data randomly\n",
    "n = int(N*0.25)\n",
    "np.random.seed(seed)\n",
    "indices = np.sort(np.random.choice(N, n))\n",
    "xGiven = xVals[indices].reshape((-1,1))\n",
    "yGiven = yValsNoisy[indices].reshape((-1,1))\n",
    "yTrueAtGivenx = yVals[indices].reshape((-1,1))\n",
    "\n",
    "# let's plot the data we are given\n",
    "fig, ax = plt.subplots()\n",
    "ax.plot(xGiven, yGiven, '.', xGiven, yTrueAtGivenx, '-')\n",
    "ax.set(xlabel='x', ylabel='y')\n",
    "plt.title(\"The data given to us: \" + str(n) + \" points\")\n",
    "ax.legend(['Data given to us', 'Original function'],loc='upper left')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we split the data given to us into a training set, a validation set, and a test set\n",
    "# train on 60% of the data, and keep 20 % of the data for validation and 20% for test \n",
    "def trainTestValidationPartition(xValues, yValues):\n",
    "    xTrain, xx, yTrain, yy = train_test_split(xValues, yValues,test_size=0.4)\n",
    "    xTest, xVal, yTest, yVal = train_test_split(xx,yy, test_size=0.5)\n",
    "    # now sort using x axis\n",
    "    p = xTrain.argsort(axis=0)\n",
    "    xTrain = xTrain[p[:,0]]\n",
    "    yTrain = yTrain[p[:,0]]\n",
    "    p = xVal.argsort(axis=0)\n",
    "    xVal = xVal[p[:,0]]\n",
    "    yVal = yVal[p[:,0]]\n",
    "    p = xTest.argsort(axis=0)\n",
    "    xTest = xTest[p[:,0]]\n",
    "    yTest = yTest[p[:,0]]\n",
    "    \n",
    "    return xTrain, yTrain, xTest, yTest, xVal, yVal\n",
    "\n",
    "np.random.seed(seed)\n",
    "xTrain, yTrain, xTest, yTest, xVal, yVal = trainTestValidationPartition(xGiven, yGiven)\n",
    "\n",
    "# fit a polynomial of given degree\n",
    "def fit(xVals, yVals, xForm, degree):\n",
    "    A = xForm.fit_transform(xVals)\n",
    "    return LinearRegression().fit(A, yVals)\n",
    "\n",
    "def predict(reg, xVals, xForm):\n",
    "    A = xForm.fit_transform(xVals)\n",
    "    return reg.predict(A)\n",
    "    \n",
    "\n",
    "# plot the training and test errors for the models obtained when varying\n",
    "# d\n",
    "errTrainAry = []; errValAry = []\n",
    "for i in range(1,21):\n",
    "    xForm = PolynomialFeatures(i)\n",
    "    reg = fit(xTrain, yTrain, xForm, i)\n",
    "    yTrainPred = predict(reg, xTrain, xForm)\n",
    "    errTrainAry.append(mean_squared_error(yTrain, yTrainPred))\n",
    "    yValPred = predict(reg, xVal, xForm)\n",
    "    errValAry.append(mean_squared_error(yVal, yValPred))\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "ax.plot(range(1,21), errTrainAry, 'bo--', range(1,21), errValAry, 'g*--')\n",
    "ax.set(xlabel='Polynomial degree', ylabel='MSE',\n",
    "    title='Train and test errors')\n",
    "ax.legend(['Train error', 'Validation error'],loc='upper left')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# interactively plot the data we have and the best fit for a\n",
    "# given polynomial order\n",
    "\n",
    "# Utility function to show two functions on the same plot\n",
    "def plotTwoFuncs(x1Vals, y1Vals, x2Vals, y2Vals, err1, err2):\n",
    "    fig, ax = plt.subplots()\n",
    "    ax.plot(x1Vals, y1Vals, 'b.', x2Vals, y2Vals, 'g--')\n",
    "    ax.set(xlabel='x', ylabel='y',\n",
    "       title='Function approximation')\n",
    "    ax.legend(['Training data', 'Approximation'],loc='upper left')\n",
    "    ax.annotate('Training err = ' + format(err1, \".3E\"), xy=(0.125,-2))\n",
    "    ax.annotate('Validation err = ' + format(err2, \".3E\"), xy=(0.125,-3.5))\n",
    "    plt.show()\n",
    "\n",
    "def plotFit(d):\n",
    "    xForm = PolynomialFeatures(d)\n",
    "    reg = fit(xTrain, yTrain, xForm, d)\n",
    "    yTrainPred = predict(reg, xTrain, xForm)\n",
    "    errTrain = mean_squared_error(yTrain, yTrainPred)\n",
    "    yValPred = predict(reg, xVal, xForm)\n",
    "    errVal = mean_squared_error(yVal, yValPred)\n",
    "    # now plot the training data and the predicted data\n",
    "    plotTwoFuncs(xTrain, yTrain, xTrain, yTrainPred, errTrain, errVal)\n",
    "\n",
    "interactive_plot = interactive(plotFit, d=(1, 50))\n",
    "output = interactive_plot.children[-1]\n",
    "output.layout.height = '350px'\n",
    "interactive_plot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **Using the plot above, what what order of polyomial would you choose? Why?**\n",
    "- **Now that you have picked a model, find the test error for the model that you trained using the code below. Observe that we only use the test set once**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Evaluate the test error on your order choice\n",
    "degree = 200\n",
    "# degree = your chosen value\n",
    "xForm = PolynomialFeatures(degree)\n",
    "reg = fit(xTrain, yTrain, xForm, degree)\n",
    "yTestPred = predict(reg, xTest, xForm)\n",
    "errTest = mean_squared_error(yTest, yTestPred)\n",
    "print(\"Test error: \" + format(errTest, \".3E\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **Discuss: validation is used to as a method to set hyperparameters in our models. Do you think there are disadvantages to validation?**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross-validation\n",
    "- Validation is a tool we often use to tune hyperparameters\n",
    "- Its drawback is that we remove data from our training set. Labeled data can be expensive, so we wonder if there is a way to avoid losing this data\n",
    "- Suppose that you take away $k$ points for validation\n",
    "    - **What happens if you set $k$ too low?**\n",
    "    - **What happens if you set $k$ too high?**\n",
    "\n",
    "- Remember that we used valdation in order to tune a hyperparameter\n",
    "- It would be great if we could train using our entire training set while keeping a local estimate of the generalization error\n",
    "- Here's a way called \"leave one out\" cross validation:\n",
    "    - Suppose you are given data $D$. Partition into two sets $D = D_r \\cup D_e$, the training and test datasets\n",
    "    - For every value $\\phi_m$ of the hyperparameter $\\phi$\n",
    "        - For every point $x_i$ in $D_r$\n",
    "            - Create a training set $D_i = D_r \\setminus \\{x_i\\}$, i.e., we just remove one point\n",
    "            - Train $g_i^m = A(D_i, \\phi_m)$\n",
    "            - Compute the validation error *on the single point left out*: $e_i^m = $ error of $g_i^m$ at $x_i$\n",
    "        - Set $e^m = \\text{mean}_i(e_i^m)$, i.e., the mean of all validation errors on one point\n",
    "        - $e^m$ is a good proxy for the generalization error\n",
    "    - Now pick the parameter $\\phi_m$ that resulted in the lowest $e_m$ and return a model $g = A(D_r, g_m)$\n",
    "    \n",
    "    \n",
    "- **What are the advantages and disadvantages of \"leave one out\" cross-validation?**\n",
    "- **How do you think we could address those disadvantages?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
